There two things we need to do before we start the lesson.
Install R: Any of the links on that page should work
Install Rstudio: Choose the option that is suited for your Operating System (OS).
Get familiar with your working environment
When you installed on your computer, you provide it the tools necessary to process code that is written in R. The Base R installation ships with a simple code editor that can be used to write R scripts. is a software application that builds on top of your R installation to give you additional functionality to write and manage R code and projects 1. It is by far the most popular integrated development environment (IDE) for R and has several features that are custom made to support data science in R.
There are four different panes on RStudio that you need to get familiar with.
The console: This is the pane that is usually located on the left (at the bottom if you have a script or notebook open) that can be used to type in R commands. The console is a great place to test out lines of code that doesn’t necessarily have to go into the project you are woring on. The console is also the place were the output of your code are printed 2. The console however, is not the place to write any code that you need to rerun since it is not possible to save code that is written in the console as a file. You will need to use the code pane to do that.
The code: This is were you will write the majority of the code for your projects. There are several different formats in which you can write code. This pane only appears if you have an open file that you are currently working on. Click on the button at the top left of R Studio to see the different options. We will use the first two options R Notebook and R Script for most of this course.
The viewer: The pane to the bottom right is where you can view plots rendered by your code, packages that are currently loaded to the environment, access help files, view documents, and access folders on your computer. Click on the tab to scroll through the list and see all the packages that are checked. These are the packages that are pre-loaded with your session. If you don’t know what packages are, don’t worry, we will be covering that very soon.
The environment: The environment pane is the one on the top-right. This is were all the variables and functions that you create are displayed. R Studio provides additional functionality that allows users to click on these variables to view their contents. Click on dropdown you will notice that there are several other packages that are listed there. These are the same as those that are checked in the packages tab on the viewer pane. Selecting one of them will display the entire list of functions that are currently available for you to use.
⚡Ninja Tasks⚡
Create a Numbers Ninja Folder on your local machine
Create an R Notebook “Class_1” and save it in this folder
In computer text processing, a markup language is a system for annotating a document in a way that is syntactically distinguishable from the text…the whole idea of a mark up language is to avoid the formatting work for the text, as the tags in the mark up language serve the purpose to format the appropriate text (like a header or beginning of a next para…etc.). Every tag used in a Markup language has a property to format the text we write. -
Hyper Text Markup Language(HTML) is perhaps the most well-known markup language out there 3. It is used to render content on websites. Go ahead and right click on any webpage and click on inspect to open up the web inspector to view the underlying HTML code that was used to generate it. Markdown is a more human friendly (but limited) version of HTML. Markdown uses simple and easy to read tags to markup text which is compiled (by your computer) into HTML that can be used to render webpages. Making it easier to create content that can be rendered on a webpage for those who are unfamiliar with HTML.
R Markdown is a package in that allows us to combine markdown with R code. Using R Markdown we can create a wide variety of easy to read and reproduce R documents that contain both code and nicely formatted. This online R notebook is a case in point.
In addition to using markdown to markup text, R Markdown documents also uses R MarkdownYAML. This is the section right at the top of the notebook that is separated by the ---. This is the part of the document were you specify the title and other characteristics of the report.
Use all the shortcuts to do the following task: create a new code chunk => assign the string “Hello World” to myFirstVar variable => run the line of code => add a print command using print(myFirstVar) => run the entire chunk
Packages
Packages that make our job as coders easier by extending the functionality of base R. Each package contains a collection of functions that we can use in our code without having to worry about needing to write and maintain them ourselves. In addition to functions, packages can also contain data or point database api (such wbstats which points to the World Banks data). R has an extremely rich, well maintained ecosystem of packages that contain functions and data that are relevant to almost any academic (or even trivial) topics of interest.
⚡Ninja Tasks⚡
Install the following packages - tidyverse and nycflights13 using the package tab in the viewer pane (use the button)
Use the library() function to load these two packages in the Class 1 notebook
Check the packages tab on your viewer to see if these packages are checked (i.e. loaded to our current environment)
Check the Global Environment dropdown to see that these are listed.
Tidyverse
The tidyverse is a collection of packages that follow the tidy tools philosophy.
The Tidy Tools Manifesto by Hadley Wickham, describes the four basic principles of a tidy API as follows:
Reuse existing data structures.
Compose simple functions with the pipe.
Embrace functional programming.
Design for humans.
There are two overarching programming concepts here that are important for us to note - composition and the use of declaritive instead of imperative code. Composition is simply the use of several smaller functions to create a more complex function. The tidyverse achieves composition through the use of pipes (%>%). We will learn about pipes relatively soon.
In computer science, declarative programming is a programming paradigm—a style of building the structure and elements of computer programs—that expresses the logic of a computation without describing its control flow.
This video explains this further using the real-world example of a car (albeit from the persepective of React a Javascript library).
The tidyverse packages are characterized by their extensive use of pipes to break down complex functions into simpler pieces that are easier to to read and understand. We will learn about pipes shortly.
We will be using the nycflights13 datasets package to learn data manipulation. The package contains relational datasets that offer information on all the flights that took off from 3 airports in NYC (EWR, JFK, LGA) in 2013. The different datasets in the package are as follows:
airlines: Names of the different airline carriers
airports: Meta data on the airports in the dataset
flights: Data on all the flights that departed NYC in 2013
planes: Meta data on the planes (their make, capacity etc)
weather: Weather at the three NYC airports on all days of 2013
For this notebook we will be using only the flights data.
Data Manipulation using dplyr
dplyr is a package in the tidyverse collection. It provides elegant and easy to understand functions for manipulating data in R.
Keep this data wrangling cheatsheet open on your browser as a ready reference for the rest of the notebook.
Tibbles
A tibble is a tidy version of a data.frame. It adds a few extra features (through two new classes) that give it a few advantages over data.frames.
⚡Ninja Tasks⚡
Create a tibble (of your choice) with three observations and three variables.
Show the class and structure of the tibble you created.
Convert the mtcars data set to a tibble.
🏆Solution🏆
Lets create a tibble using the tibble() command. Check out the code below. Each input in the tibble command is a column. Notice how I have used both column vectors created within tibble() and an externally created the vector hobby inside the command.
The tidyverse also provides a function called glimpse() to do the same thing as str(). Notice the differences between the outputs from the two commands.
You can access the class of any object in R using the class() command. Notice how there are three classes for a tibble - “tbl_df”, “df” and “data.frame”. Tibbles really are just data.frames with some added functionality on top that is provided by the df and tbl_df classes.
class(myFavoriteThings)
[1] "tbl_df" "tbl" "data.frame"
mtcars is a part of the datasets package that is preloaded in R. Lets first look at its structure using str()
As you can see the class of mtcars is a data.frame and not a tibble. We can convert it to a tibble using the as_tibble() command. Below, I have used this command to create a new tibble called mtTibble. The output shows it structure. Notice that now we have the two additional classes “df” and “tbl_df” that characterize tibbles.
mtTibble <-as_tibble(mtcars)
str(mtTibble)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The Pipe Operator
The pipe operator is a part of the magittr package. You can create one using the shortcuts you learnt earlier (Cmd + Shift + M or Ctrl + Shift + M). The pipe transfers the values that are to the left of it to the right side. While this might seem a bit abstract at this point, %>% is an excellent way to break down code into smaller pieces that are easier to read and maintain.
⚡Ninja Tasks⚡
Calculate the square root of the sum of all even numbers from 0 to 200 and create a sequence that goes from the square root to twice its value.
Repeat the task using base commands.
🏆Solution🏆
First lets do it using pipes. Notice the use of the dot on the last line of the code. When using pipes the . is used to refer explicitly to the data on the left side of the pipe. We do this so that we can perform operation on it. In our case, we need to create a sequence that goes from the left-hand side value to twice its value and we acheive this using . * 2. Note that we could have used the dot notation to refer to the data on the left hand side explicitly for each command, however, dplyr automatically assumes that you are passing the data from the left side saving us the trouble. We only need to use a . if we need to perform an operation or use it explicitly as a function argument.
##create a sequence of even numbers from 0 to 200
seq(from =0, to =200, by =2) %>%##calculate the sum of all values in the previous vector
sum() %>%##calculate the square root of the value from the sum operation
sqrt() %>%##create a new seq from the previous value to twice its value
seq(from = ., to = . *2, by =2)
Now lets do this without the pipes. Notice how the base command combines multiple steps i.e. sum(), sqrt() and seq() into a single command. Code such as this is harder to read, understand and debug compared to the version that uses %>%.
##create the a sequence, calculate its sum and find the square root
sqrtValue <-sqrt(sum(seq(from =1, to =200, by =2)))
##create a sequence
seq(from = sqrtValue, to =2*sqrtValue, by =2)
Filter is the dplyr verb for subsetting rows of data based on a particular condition.
⚡Ninja Tasks⚡
Which flight out of JFK was the most delayed in 2013?
Answer the previous question without using dplyr or pipes.
What were the 5 longest flights (in air time) from NYC in 2013?
🏆Solution🏆
Lets find the most delayed flight out of JFK
flights %>%##filter all flights from JFK
filter(origin == "JFK") %>%##find the flight that had the maximum departure delay
filter(dep_delay ==max(dep_delay, na.rm = T))
Lets try to do this using base R. Notice how the code is far less readable in this case.
dplyr also ships with helper function. We can use one of these top_n to find the flights that had the highest air times.
flights %>%top_n(air_time, n =5)
Select
Select is a dplyr verb that is used for subsetting columns.
⚡Ninja Tasks⚡
Select all the columns that are relevant for arrival and departure delays using a utility function (refer to cheat sheet)
🏆Solution🏆
We can combine the select verb with the contain helper function to achieve this task.
flights %>%select(contains("delay"))
Arrange
Arrange is a verb that arranges the rows based on the values of a particular column i.e. performs a sort.
⚡Ninja Tasks⚡
Filter the top 10 most delayed flights in JFK and arrange by dep_delay (highest to lowest)
🏆Solution🏆
Note that I have
flights %>%##arrange dep_delay in descending order (high to low)
arrange(desc(dep_delay)) %>%##filter the first 10 rows using row_number()
filter(row_number() <=10) %>%##select the relevant columns
select(month, day, origin, dep_delay)
Mutate
Mutate changes a tibble by adding a new column vector or changing an existing one.
⚡Ninja Tasks⚡
Create a new variable called total_delay that is the sum of the arrival delay and departure delay
Do the same task using Base R
🏆Solution🏆
flights %>%##combine arrival delay and departure delay
mutate(tot_delay = arr_delay +dep_delay) %>%##select all the delay columns
select(contains("delay"))
This is true for all formats other than those that use R Markdown. In the case script formats that use R markdown the output is printed within the script rather than in the console.↩
Some other examples include - TeX and LaTeX. TeX is the markup language that used as a typesetting system in computers for high quality books. LaTeX is used extensively in academia to create documents.↩